24 Hour News-cycles are Exhausting so Lets Automate
What is Web Scraping?
Web scraping is a great way to automate data collection in R. Luckily it’s also pretty easy to scrape infomation from html pages using the code below. Lets get started.
Step 1: Load the packages!
library(rvest) # Handles the actual scraping
library(kableExtra)
library(knitr)
library(dplyr) # Bestows upon us the piping '%>%' operator
library(compiler) # Compiles our nice functions
library(DT) # Makes our nice widgitStep 2: Write the functions for scraping from our selected news sources
We’re using the css elements corresponding to the bodies of text that we want to scrape and setting the function input as the specific url of the page that we want to scrape from.
As I only wrote this to read some headlines, I’ve limited the length of each object to 10 so that the dataframe won’t thow any errors of the lengths of each object are different. Feel free to change this.
fetch_news_Twitter <- function(url){
webpage <- read_html(url)
newsdata <- html_nodes(webpage, '.tweet-text')
text_data <- html_text(newsdata)
text_data <- text_data[-c(11:100)]
}
### Function for scraping Reddit
fetch_news_Reddit <- function(url){
webpage <- read_html(url)
newsdata <- html_nodes(webpage, '.outbound')
text_data <- html_text(newsdata)
text_data <- text_data[-c(11:100)]
}
### Function for scraping RT
fetch_news_RT <- function(url){
webpage <- read_html(url)
newsdata <- html_nodes(webpage, '.link_hover')
text_data <- html_text(newsdata)
text_data <- text_data[-c(11:100)]
}Step 3: Lets compile the functions we wrote above to make them extra speedy
### Compile all functions to byte code
fetch_news_Twitter <- cmpfun(fetch_news_Twitter)
fetch_news_Reddit <- cmpfun(fetch_news_Reddit)
fetch_news_RT <- cmpfun(fetch_news_RT)Step 4: Create the data objects containing the bodies of text that we scraped
If you want to write something similar, you can always just copy this code and change the urls and the object names, it will work as long as the css elements we used in step 2 are corect.
### All Twitter Information ###
# All China related news from Twitter
china_data <- fetch_news_Twitter('https://twitter.com/search?f=news&q=china&src=typd')
# All Tunisia related news from Twitter
tunis_data <- fetch_news_Twitter('https://twitter.com/search?f=news&q=tunisia&src=typd')
# All general news data from Twitter
news_data <- fetch_news_Twitter('https://twitter.com/search?f=news&q=news&src=typd')
# All Guardian news from Twitter
guard_data <- fetch_news_Twitter('https://twitter.com/guardian?lang=en')
# All BBC news from Twitter
bbc_data <- fetch_news_Twitter('https://twitter.com/bbc?lang=en')
# All SBS news from Twitter
sbs_data <- fetch_news_Twitter('https://twitter.com/SBSNews?lang=en')
### All Reddit Information ###
# All World News from Reddit
reddit_world_data <- fetch_news_Reddit('https://old.reddit.com/r/worldnews/new/')
# All News from Reddit
redditnews_data <- fetch_news_Reddit('https://old.reddit.com/r/news/new/')
### All Other Sources of News ###
#RT news
rt_newsdata <- fetch_news_RT('https://www.rt.com/news/')Step 5: Use the data objects above to inialize the final dataframe
# Make the first table for Kable to understand
tweets <- as.data.frame(guard_data, news_data)
name <- c("The Guardian", "China News", "Twitter News", "BBC News", "SBS News", "Reddit World News", "Reddit General News", "RT News", "Tunisia News")Step 6: Finalize the dataframe
#Customise the text tables for consistency unsing HTML formatting
tweets <- tweets %>%
mutate(china_data = china_data,
news_data = news_data,
bbc_data = bbc_data,
guard_data = guard_data,
sbs_data = sbs_data,
reddit_world_data = reddit_world_data,
redditnews_data = redditnews_data,
rt_newsdata = rt_newsdata,
tunis_data = tunis_data)
names(tweets) <- nameStep 7: Style the widgit we’ll use to actually read the scraped data
You can also make this an r script, then add saveWidget(tweets, "nameOfThisWidgit.html") as the last line, so that after the code executes, the widgit will be saved by its self in an html file.
This has the added benefit of being able to download our data as a csv file which is a nice benefit. This widgit also has search functionality, which is also a nice touch if you planning on scraping large volumes of data.
datatable(tweets,
extensions = 'Buttons',
width = 1500,
filter = 'top',
class = "display",
options = list(autoWidth = TRUE,
scrollX=F,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
initComplete = JS("function(settings, json) {",
"$('body').css({'font-family': 'Helvetica'});",
"$(this.api().table().header()).css({'font-family': 'Helvetica', 'background-color': '#3BEFFF', 'color': '#000'});",
"}"))) %>%
formatStyle("The Guardian", color = 'black', backgroundColor = '#FFF5A6') %>%
formatStyle('China News', color = 'black', backgroundColor = '#FFE9BF') %>%
formatStyle("Twitter News", color = 'black', backgroundColor = '#FFF5A6') %>%
formatStyle("BBC News", color = 'black', backgroundColor = '#FFE9BF') %>%
formatStyle("SBS News", color = 'black', backgroundColor = '#FFF5A6') %>%
formatStyle('Reddit World News', color = 'black', backgroundColor = '#FFE9BF') %>%
formatStyle('Reddit General News', color = 'black', backgroundColor = '#FFF5A6') %>%
formatStyle('RT News', color = 'black', backgroundColor = '#FFE9BF') %>%
formatStyle('Tunisia News', color = 'black', backgroundColor = '#FFF5A6')